A COVID‐19 CXR image recognition method based on MSA‐DDCovidNet

Abstract Currently, coronavirus disease 2019 (COVID‐19) has not been contained. It is a safe and effective way to detect infected persons in chest X‐ray (CXR) images based on deep learning methods. To solve the above problem, the dual‐path multi‐scale fusion (DMFF) module and dense dilated depth‐wise separable (D3S) module are used to extract shallow and deep features, respectively. Based on these two modules and multi‐scale spatial attention (MSA) mechanism, a lightweight convolutional neural network model, MSA‐DDCovidNet, is designed. Experimental results show that the accuracy of the MSA‐DDCovidNet model on COVID‐19 CXR images is as high as 97.962%, In addition, the proposed MSA‐DDCovidNet has less computation complexity and fewer parameter numbers. Compared with other methods, MSA‐DDCovidNet can help diagnose COVID‐19 more quickly and accurately.


INTRODUCTION
The 2019-nCoV is spreading with an extremely fast rate. Coronavirus disease 2019 (COVID-19) caused by 2019-nCoV has put many countries and regions with scarce medical resources and low medical standards into trouble. The most common used method for diagnosing COVID-19 is a detection method based on reverse transcriptase polymerase chain reaction (RT-PCR). It has high specificity, but the current demand for detection kits is increasing [1]. In addition, its sensitivity is low, which makes it prone to false negative diagnostic results. reliable, and low-cost detection method should be sought. CXR is the most widely used imaging test to diagnose heart and other chest diseases [2]. Compared with CT scans, CXR is more popular, and X-rays have lower ionizing radiation [3]. Detecting diseases through chest radiographs is an extremely challenging task. It requires a certain amount of professional knowledge and careful observation. COVID-19 contains some radiological features that can be detected by CXR. However, if these characteristics are analysed by manual film reading, not only will it take up a lot of medical staff's time, but it will also be prone to errors due to visual fatigue and other disturbances. Therefore, it is necessary for us to find a way to automate the detection of CXR.
The purpose of this study is to search a lightweight and accurate CXR image automatic recognition method of COVID-19 to assist medical staff in diagnosis. Since convolutional neural networks (CNNs) have excellent performance in image recognition task, especially in image classification task, CNN model is considered to realize this method. In order to ensure that the model can accurately identify the CXR image of COVID-19 in a low-cost way, deep separable convolution [4], feature reuse and multi-scale feature fusion are adopted fully when designing the network structure.
The remainder of the paper is arranged as follows: Section 2 discusses the related work of CNN image recognition and medical image recognition. Section 3 describes the structure of our proposed network and its modules. Section 4 shows the experimental dataset, parameter setting and experimental results and analyses the results in detail. Section 5 carefully analyses the advantages of the structure of MSA-DDCovidNet and the limitations of the study. Section 6 summarizes the paper and describes our prospects for the future of this study.

RELATED WORK
In recent years, deep learning has been widely used in medical image detection. For example, Wang W et al. [5] applied the image classification method based on Deep Learning to the classification of Colonic Polyps and proposed the improved approaches VGGNets-GAP and ResNets-GAP with global average pooling (GAP) to classified colonoscopy polyp images for assisted diagnosis. Inspired by the DenseNet [6] and MobileNet [4], Wang W et al. [7] proposed Dense-MobileNet, which got a good performance in children's colonoscopy polyp dataset. As a representative branch of deep learning technology, convolutional neural network (CNN) has excellent performance in image feature extraction and learning [8]. Therefore, researchers recommend using deep learning technology to help detect lesion information on CXR images, save medical resources, and improve diagnosis efficiency. For example, Khan et al. [9] proposed the CoroNet based on the structure of Xception [10], which achieved good performance on the COVID-19 CXR image classification. Based on Xception [10] and ResNet50V2 [11], Rahimzadeh et al. [12] designed a network which improved the performance of the network by combining the output feature of the two networks.

ARCHITECTURE DESIGN
Commonly, CXR images of different classes are highly convergent, and CXR images in the same class have low specificity. This leads to model deviation and overfitting, which reduces the performance and generalization of the model. Moreover, CNN for mobile terminals requires a model with few parameters and fast speed, otherwise it will cause delays and undermine recognition efficiency. In response to the above problems, a new lightweight CNN, MSA-DDCovidNet, is proposed, based on DMFF module and D3S module and the multi-scale spatial attention (MSA) mechanism.

DMFF module and D3S module
The DMFF module and the D3S module are innovatively proposed by our team, and both are modules based on deep separable convolution. They have high computational efficiency and have strong representational capacity on the shallow and deep feature maps respectively. Their structure diagrams are shown in Figures 1 and 2. In Figures 1 and 2, H, W, and C denote the height, width, and channels of the feature maps, respectively; f means the number of convolution kernels, k represents the size of the convolution kernel, and s denotes the step size. Depth Separate convolution decomposes the convolution process into two processes: depth-wise convolution and point-wise convolution. Such decomposition process can greatly reduce the amount of calculation and model parameters. Applying h-swish can alleviate the delay [17], so h-swish is adopted as the activation function in the network. DMFF module splits the input feature maps into channels and generates two branches. After increasing the channels with point-wise convolution, one of branches uses the dilated depthwise convolution layer, that is, the depth-wise convolution layer using dilated convolution kernel with an expansion rate of 2, instead of using the ordinary convolution kernel. The other branch uses depth-wise convolutional layer after a point-wise convolution layer. Finally, it concatenates the channels of feature maps of two branches, and gets the output after a channel shuffle [18] operation. Obviously, the receptive fields of the two branches are different. The channel-wise concatenate operation can realize multi-scale feature fusion and enhance the spatial representational capacity of the model. Since the dilated convolution with an expansion rate of 2 does not increase the complexity of the model [11], the parameters and the amount of calculation of the two branches are the same. Since the features extracted by the convolutional layer close to the input contain detailed texture information, the DMFF module will be used in the shallow layers of the proposed network.
D3S module is based on dilated depth-wise separable convolutional layer and dense connection. The input feature maps pass through a dilated depth separable convolutional layer, and then the obtained feature maps and the input feature maps are channel-wise concatenated as the output of the module. Compared with standard convolution, the dilated depth-wise separable convolution has fewer parameters and calculation, and a larger receptive field, which makes the model more lightweight and efficient. The features extracted from the deep layers of the network are more critical for distinguishing heterogeneous samples. Feature reuse can alleviate information loss. Therefore, the D3S module will be used in the deep layers of the proposed network.

Multi-scale spatial attention (MSA) mechanism
Inspired by Kim et al. [19], a novel multi-scale spatial attention (MSA) mechanism is proposed. Before being input to the fully connected layer, the feature map will be input to MSA attention, as shown in Figure 2. Let there be L successive D3S modules The structure of Multi-scale Spatial Attention mechanism in the network. On the one hand, to obtain a spatial attention map, the feature maps output by the first DMFF module will be input to a global average pooling layer and a standard convolution layer. The resulting feature maps are token as a spatial attention map; on the other hand, three groups of feature maps containing different depth semantic features are channel-wise concatenated. Such resulting feature maps contains rich multiscale deep features. These feature maps are multiplied with the spatial attention map to extract the key spatial information in the feature map. Compared with the single-scale spatial attention mechanism, MSA mechanism can capture feature information of different depths, and has better spatial representational capacity.

The structure of MSA-DDCovidNet
The structure of MSA-DDCovidNet is shown in Figure 3. The input image is preprocessed before being input to the model. The first layer contains a dilated convolution filters with an expansion rate of 2. Then the DMFF module is used for five times to halve the spatial dimension (the height and width) of the feature maps, remove redundant information and compress the features. And then the depth-wise separable convolution layer is designed to enrich feature information. Next, nine successive D3S modules are set to extract deep features and alleviate the disappearance of gradients. Then the MSA mechanism is used to extract the spatial domain information in the multi-scale feature maps. After the global average pooling layer, the spatial size of the feature maps becomes 1 × 1. Then a point convolution layer is used to increase the feature dimension and full connection layer. Next, a fully connected layer is used to reduce the impact of feature coordinate information on classification. Finally, the SoftMax layer is used for classification.

Network complexity
In this work, the amount of computation and the number of parameters are adopted to measure the complexity of the model. The parameters generated by the weight layers in CNN, which mainly includes convolution layer and full connection layer. The amount of computation refers to floating-point operations (FLOPs). All kinds of operations in the network will produce computation, even a simple elementwise addition operation. The parameters and the amount of computation of the model are mainly related to the depth, width, the resolution of input images and the structure of model. For a given input feature map H i ⋅ W i ⋅ C i and the output feature map H o ⋅ W o ⋅ C o , the parameters P std and the amount of computation F std produced by a standard convolution are as follows: Since the dilated convolution with an expansion rate of 2 will not increase parameters and calculations, the parameters P dw and the amount of computation F dw generated by a depth-wise convolution and a depth-wise dilated convolution process with an expansion rate of 2 are as follows: For a given input feature map H ⋅ W ⋅ C and the output feature map (H / 2) ⋅ (W / 2) ⋅ (C + 16), the parameters The structure of MSA-DDCovidNet P conv3_1 and the amount of computation F conv3_1 generated by a standard convolution with kernel size 3 × 3 are as follows: And when the DMFF module is used to complete the above dimension conversion, the parameters P DMFF and the amount of computation F DMFF generated by a DMFF module are as follows: Therefore, compared to a standard convolution, the reduction in parameter ∆ DMFF_P and computation ∆ DMFF_F achieved by DMFF module is shown as follows: Similarly, for a given input feature map H ⋅ W ⋅ C and the output feature map H ⋅ W ⋅ (C + 16), the parameters P conv3_2 and the amount of computation F conv3_2 generated by a standard convolution with kernel size 3 × 3 are as follows: When the D3S module is used to complete the above dimension conversion, the parameters P D3S and the amount of computation F D3S generated by a D3S module are as follows: Therefore, compared to a standard convolution, the reduction in parameter ∆ D3S_P and computation ∆ D3S_F achieved by D3S module are shown as follows: Obviously, ∆ DMFF_P > 0, ∆ DMFF_F > 0, ∆ D3S_P > 0 and ∆ D3S_F > 0, which means DMFF module and D3S module make positive contribution to reduce the parameters and calculation.
The complexity of MSA mechanism is analysed. For three sets of input feature map with shapes H ⋅ W ⋅ C, H ⋅ W ⋅ (C + 16), H ⋅ W ⋅ (C + 32) input feature map, the output feature map H ⋅ W ⋅ (3 ⋅ C + 48) and the shallow feature map H 1 ⋅ W 1 ⋅ C 1 , the parameters P MSA and the amount of computation F MSA generated by MSA mechanism are as follows:

Dataset
Two different datasets were used in this study. The first dataset mentioned in this paper is used in the comparative experiment between MSA-DDCovidNet network and some state-of-the-art CNNs. CXR images in the above dataset come from two datasets: Kaggle CXR dataset [20] (https://www.kaggle.com/paultimothymooney/chest-xraypneumonia) and the dataset collected by Joseph et al. [21].
Kaggle CXR dataset has a total of 5863 images, including pneumonia and normal CXR images. In the following section, COVIDx dataset [22] is adopted to verify the performance of MSA-DDCovidNet on other CXR image datasets. The COVIDx dataset is obtained according to the dataset generation method provided by Wang et al. [22], and finally got 589 COVID-19 images, 8851 normal images and 6053 images of pneumonia. Similar to the method of Nihad et al. [23], 100 COVID-19 images, 885 normal images, and 594 pneumonia images in COVIDx are randomly selected as the test set, and the remaining as the training set. Figure 4 shows an example of various CXR images in the experimental dataset of this work. It can reflect the high interclass similarity and low intra-class variance of CXR images, which ratchet up the difficulty to the CXR images classification task.

4.2
The evaluation criteria of model In terms of model evaluation criteria, we refer to the evaluation criteria adopted by most medical image classification models. Accuracy, precision, sensitivity, specificity, F1-score, receiver operating characteristic (ROC) curve and area under the curve (AUC) are adopted as the model evaluation criteria. Some of the formulas for these evaluation criteria are as follows: In these equations, T P denotes true positive, F P means false positive, F N represents false negative, and T N represents true negative.

Preprocessing and parameter settings
Since model training requires sufficient data samples, data augmentation techniques are used in this work. First, the resolution of the CXR images is scaled to a fixed size of 256 × 256, and the centre crop is applied to make the size 224 × 224. Then we perform a series of data enhancement processing on the training set: flip the CXR images horizontally with a probability of 0.5, and then randomly adjust the brightness, contrast, and saturation of the images to 0.6-1.4 times. After data enhancement technology, in fact, the number of samples used for training is four times that of the training set. This article conducts all experiments in the same configuration environment. The software platform and hardware environment are shown in Table 1. After many experiments, the training strategy of this experiment is summarized. The initial learning rate of the experimental models was set to 0.001. Each group of experiments was trained 150 cycles of epoch, and the loss function was the Cross-Entropy loss function for label smoothing regularization [24] with epsilon = 0.1. And Adam [25] optimizer with betas = (0.9, 0.999) is used to make the model converge quickly. The batchsize of training set and test set are 32 and 16 respectively.

Experimental results and analysis
In order to illustrate the lightweight and classification performance of our proposed model, several state-of-the-art models  are used as the control group in the experiments, such as VGG19 [26], GoogLeNet [27], ResNet50 [28], DenseNet121 [6]. The control group also contain various lightweight networks such as SqueezeNet1.0 [29], ShuffleNet [30], MobileNetV2 [18] and ShuffleNetV2 [31]. The performance of the above models is shown in Table 2. As can be seen from the Table 2, the classification accuracy, precision, sensitivity, specificity and F1 score of MSA-DDCovidNet are 97.96%, 98.09%, 98.07%, 98.33% and 98.07%, respectively. Obviously, each criteria value of our proposed network is better than other networks. Taking the traditional network ResNet50 [28] in the control group as an example, its accuracy is 93.53%, which is the traditional network with the highest accuracy in our experiment. However, it is still 4.43% lower than the proposed network.
In terms of the network complexity, it can be seen from the Table 3 that the parameter and the amount of calculation of MSA-DDCovidNet outperform the other methods. Taking the lightweight networks ShuffleNet [30] and SqueezeNet1.0 [29] as examples, they are the networks with the least amount of calculation and parameters in the control group respectively. But they are still not as lightweight as our network, and their classification performance is also far less than our network. Moreover, as shown in Table 3 the parameters and the amount of  [28] are 54.68 and 43.21 times that of ours respectively, which is obviously not as light-weight as MSA-DDCovidNet. Figure 5 shows the confusion matrix of MSA-DDCovidNet on test set. As can be seem from Figure 5, the sensitivity of COVID-19 is 95.10% when 97 images are detected from 102 tested images. In addition, the true detection of the Normal class is 98.29%. Further, the Pneumonia class achieves 98.46% success ratio. Based on this confusion matrix, the values of various criteria of MSA-DDCovidNet are calculate, as shown in Table 4. As shown in Table 4, the weighted average precision, sensitivity, and specificity of MSA-DDCovidNet are all higher than 97%, which are 97.95%, 97.93% and 98.23% respectively. More notably, the precision and specificity of MSA-DDCovidNet to recognize COVID-19 reach 100%. Since the baseline sensitivity of Covid-19 CXR images is 69% [32], it proves that our proposed network can effectively improve the diagnostic efficiency of COVID-19.
In addition, some deep learning methods for detection of COVID CXR images are compared with MSA-DDCovidNet, as shown in Table 5. As is shown in Table 5, DarkCovid-Net [14] has the fewest parameters among the five comparison models. But it is still 2.68 times more than that of MSA-DDCovidNet, and its classification accuracy is 10.94% lower than MSA-DDCovidNet. ECOVNet-Soft [23] has the highest accuracy among the five comparison models, which is still 2.26% lower than our proposed network, and its parameter is 12.146 times that of our proposed network. Therefore, considering the network performance and complexity, it demon-strates that our proposed network is a recommendable intelligent method for recognizing CXR images of COVID-19.
The results of these excellent methods are obtained in different datasets. If these methods are verified with the same data set, and the performance differences will be more intuitive and convincing. In order to further verify the effectiveness of MSA-DDCovidNet, an experiment is supplemented with COVIDx [22] dataset: The performance of the six models in Table 5 in COVIDx [22] dataset under the experimental environment and parameter settings of this study (see Section 4.3 for details) will be observed and compared. The results of the above experiments are shown in Table 6.
As shown in Table 6, CoroNet, proposed by Khan et al. [9], outperforms the other models in all criteria. Based on Xception [10], CoroNet [9] adopts deep separable convolution to reduce the parameters of the model, instead of standard convolution. However, the large depth and width of the network result in a mass of parameters. Covid-Net [22] makes full use of point convolution and depth separable convolution in the PEPX module, which effectively reduces the parameters, and finally obtains a better performance with fewer parameters. XResNet50V2 [12] by Rahimzadeh et al. contains two parallel sub-networks: Xception [10] and ResNet50V2 [11], and adopts a fully connected layer to classify the features extracted by these two sub-networks, which produces a mass of parameter. Moreover, its complex structure makes it difficult to optimize. Therefore, in the end, it needs more parameters, but it can't get good performance. The structure of DarkCovidNet [14] is similar to VGGNet [26], consisting of some standard convolutional layers, max pooling layers and fully connected layers. It has fewer parameters with low depth and width, which makes it difficult to learn a relatively large data set, like COVIDx. Therefore, DarkCovidNet [14] performs poorly in this experiment. After the experimental preprocessing, the CXR images in COVIDx are finally resize to 224 × 224. For better comparison, the ECOVNet-Soft in this experiment is based on the EfficientNet-b0 model, rather than the original EfficientNet-b5. The ECOVNet-Soft obtained by this method is a relatively lightweight network, and its performance in this experiment is slightly different from that in the original paper [23].  Such difference is considered reasonable due to the difference of hardware devices. MSA-DDCovidNet is the model with the fewest parameters in the experiment. Due to the application of deep separable convolution, feature reuse and multi-scale feature fusion, it still performs well in this experiment. From a comprehensive point of view, although CoroNet [9] and Covid-Net [22] have achieved better performance with sophisticated designs, their parameters are more than 28 times that of MSA-DDCovidNet. Moreover, MSA-DDCovidNet can perform better than those more complex models such as XResNet50V2 [12], DarkCovidNet [14], ECOVNet-Soft [23]. ROC curve is considered as an effective evaluation method that reflects the classification performance of the model. It can reflect the trade-off between the true positive rate and the false positive rate. Figure 6 shows the ROC curves of the six models. The labels in Figure 6 show the micro and macro average and class-wise AUC scores.
The comparison results of Figure 6 are similar to those in Table 6. Both CoroNet [9] and Covid-Net [22] have better ROC curves and AUC values, and the performance of MSA-DDCovidNet is only behind these two networks. It can also be found in Figure 6 that the three underperforming net-works -XResNet50V2, DarkCovidNet, and ECOVNet-Soft -have poor classification capabilities for COVID-19. In the same experimental settings with the other models, DarkCovid-Net underperforms. The intuitive explanation is that its low depth and width make it difficult to detect relatively few Covid-19 CXR images among the numerous CXR images. In contrast, MSA-DDCovidNet has achieved a relatively well performance with fewer parameters. In summary, MSA-DDCovidNet is a network worthy of being applied to CXR image recognition.

DISCUSSION
In order to verify that the multi-scale spatial attention mechanism is better than the traditional spatial attention mechanism, a network SSA-DDCovidNet is designed as the control group.
In the SSA-DDCovidNet, the attention mechanism in MSA-DDCovidNet is replaced with the traditional single-scale spatial attention mechanism to obtain SSA-DDCovidNet. Figure 7 shows the accuracy curves of the two networks in the experimental dataset (). As can be seen from Figure 7, the average accuracy of the proposed network in 150 epochs is higher than  Table 5 that of SSA-DDCovidNet, and the highest accuracy is 2.03% higher than that of SSA-DDCovidNet. An additional experiment is conducted to verify the need for obtaining spatial attention map. Two networks are designed in this experiment: D3S9Net and DMFF5Net as comparison networks. In MSA-DDCovidNet, the output feature map of the 1st DMFF module is used to generate spatial attention map. While in D3S9Net, the output feature map of the 9th D3S module is used to generate spatial attention map. Similarly, in DMFF5Net, the output feature map of the 5th DMFF module   Figure 8. Our interpretation of this result is that in each down sampling, the feature map will lose some spatial information. Since the features in the shallow feature map are not compressed many times, the included features are relatively complete. Therefore, it is more reasonable to obtain the spatial attention map in the shallow layer of the network.
As a lightweight network, MSA-DDCovidNet gets great advantages from its structure. But its performance still has a gap with some sophisticated and highly complex networks. The model needs further study and improvement in the future. And MSA-DDCovidNet will be rescaled in the further work, under the premise of ensuring the lightweight of the network, using more parameters for better performance.

CONCLUSION
In this paper, to recognize COVID-19 CXR images effectively, two kinds of feature sensitive modules proposed by our team are used: DMFF module and D3S module. Based on these two modules and MSA mechanism, we proposed MSA-DDCovidNet with strong spatial representation capacity and few parameters. To verify the performance of our proposed network, two datasets are adopted. In the preliminary experiment, 4265 CXR images of pneumonia patients, 1575 normal CXR images and 412 CXR images of COVID-19 patients are selected from two datasets. The performance of our network is compared with a series of other networks through experiments.
The results of the preliminary experiment show that MSA-DDCovidNet has excellent performance, and its classification accuracy for test set is 97.96%. More notably, its precision, sensitivity and specificity for COVID-19 are 100%, 95.10% and 100%, respectively. In addition, a larger dataset COVIDx is also adopted to verify the performance of MSA-DDCovidNet. An additional experiment is designed and the performance of MSA-DDCovidNet is compared with some other deep learning models. Finally, MSA-DDCovidNet got a good performance. Two additional ablation experiments are also conducted to verify the effectiveness of MSA mechanism. Therefore, it's believed that using MSA-DDCovidNet to detect COVID-19 CXR can effectively improve the diagnostic efficiency, and help detect and isolate patients in time. Due to the shortage of COVID-19, it's necessary to collect more COVID-19 CXR images to better illustrate the effectiveness of our proposed network. Although MSA-DDCovidNet performed very well in the experiment, it still needs further clinical research and testing. After further training and testing, MSA-DDCovidNet is expected to be put into practical application in auxiliary diagnosis COVID-19.